/* copyright 2009 databaseguy.com; all rights reserved */

Lost in translation: what really happens when you run that query

Translation = database developers and administrators (we'll call them DBDEVs for convenience) <--> application and service developers (APPDEVs)

Best DBDEVs know a little about the life of a APPDEV (give personal examples).
Best APPDEVs need to understand something about the DBDEV world too.

AUDIENCE:
   * APPDEVs who write code that accesses database servers
   * DBDEVs who work with databases that are accessed by apps
   * This audience scope isn't so much *who* I am talking to as *what* I am talking about

What about me (speaker info)

DBDEVs spend their whole day thinking about indexes, access paths, query optimization, set-based logic, cardinality of data, etc.
APPDEVs have a lot to work with besides just the database, so it's tempting to view the database as "just one more service call"

To an extent, that assessment is true!  When a DBDEV is doing their job well, their stuff is ready on time (meaning before the APPDEVs need it) and "it just works."
However, what happens in the database is fundamentally different from, say, what happens in a non-database Web service you call.

For APPDEVs, my goal is that by the end of this presentation, you will understand some of those fundamental differences better than you do now, so that you will be better equipped to either work with the database yourself, or work with your friendly neighborhood DBA / DBDEV.

For DBDEVs, my goal is that during this presentation, I will either re-affirm what you already know in a fresh way, or maybe even show you a new thing or two.  Also, if I do my job well, you may understand that the APPDEV world a little better, so that maybe you will be better equipped to write client code yourself, or work with your friendly neighborhood APPDEV.

What this talk is NOT:
This is not a presentation on how to tune individual database queries.
However, I will be giving you a framework for thinking about database queries that will allow you to more intelligently apply what you do know.
I will also show you tools that help you measure some of the things we will be talking about.

High-level outline of where we are going:

Overview of query processing:

   * PARSE
   * NORMALIZE / SIMPLIFY / ALGEBRIZE
   * OPTIMIZE 
   * COMPILE
   * EXECUTE

The first two, collectively, are sometimes called "parsing."
The first four, collectively, are sometimes called "compilation" or "optimization."

Diving deeper:

PARSING

Sure, it can take a lot longer to do an in-memory sort.
Sure, it can take a long time to send out a large result set.
But just reading the query?  How hard could that be?

Anybody ever write a parser?  Pound for pound, ounce for ounce, this might be the one of the hardest things SQL Server has to do.

S      -- what's the query?
SE     -- what's the query?
SEL    -- what's the query?
SELECT -- what's the query?

SELECT   *
FROM     Production.Product AS p
JOIN     Production.ProductModel AS pm
         ON p.ProductModelID = pm.ProductModelID
            AND pm.Name = 'HL Road Frame'
WHERE    p.Size = 58
         AND p.SizeUnitMeasureCode = 'CM'
ORDER BY p.ProductID
OPTION   ( MAXDOP 1 ) ;

Obviously it takes some code to pick this apart and turn it into something that SQL Server can run.

PARSING:  
  * break the statement into tokens.
  * build initial parse tree.
  * validate syntax (not object names).

So this will parse as valid even if dbo.BogusObjectName does not exist:

  SELECT * FROM dbo.BogusObjectName

So will this:

  SELECT * FROM dbo.ActualProcName

This will not parse as valid even if dbo.ActualObjectName exists:

  SELECT * dbo.ActualObjectName

After parsing successfully, the normalizer (and a component called the algebrizer) kicks in.

NORMALIZING (SIMPLIFYING) and ALGEBRIZATION:
* validate that all referenced objects exist.
* bind IDs in place of names.
* do some simplifications.
* what's wrong with this statement?
* SET @x = 3*1 + 3*2 + 3*3;

  * validate that the query makes sense semantically.
  * expand view references into their definitions (usually).
  * and so forth.

OPTIMIZATION
* Try to find a cheap way to run the query.
* Not always trying to find the CHEAPEST way.

  * First attempt: trivial optimization
    * this is one last attempt to avoid running a full cost-based optimization
    * this optimization will succeed if
      * there is only one viable plan
        - example INSERT INTO dbo.mytable (x, y, z) VALUES (1, 2, 3)
      * a safe plan is available which must be nearly optimal
        - example SELECT * FROM dbo.mytable WHERE uniqueId = 1

  * If we really have to: full optimization
    * Query analysis is performed to see if there are optimizable (sargable) search arguments
    * Optimizer considers all useful indexes that can be used to optimize sargable expressions
      * Index statistics are used for this.
        - Index structure is outside today's scope.
        - For today's discussion, an index is a fast way to use a key to return either a row or a pointer to a row.
        - There are two kinds of pointers, which is also outside today's scope (Row ID and Clustered Key).
        - If you have to follow a pointer to go get the rest of the row, this is called a Key Lookup, RID Lookup, or Clustered Index Seek
          + and those can be costly
    * Similar analysis is performed for all feasible join methods, aggregation methods, etc.
  
  * Whether optimization is TRIVIAL or FULL, the output is a set of plan steps (iterators) and a dependency tree showing how they relate to each other.
  * This is what you see when you use the graphical showplan feature in SQL Server Management Studio
    * you can tell if trivial optimization has been used by looking at graphical showplan output
      -- demo that --

COMPILATION
  * Take all of that and turn it into something SQL Server can run.
  * T-SQL statements that are not SQL statements (IF, WHILE, WAITFOR, DECLARE, etc.) must be compiled even though they are not optimized.
  * Populate our plan caches (including stub plans).

Detailed (re)compilation flowchart on http://technet.microsoft.com/en-us/library/cc966425.aspx, saved to compilation-flowchart.gif.

EXECUTION
  * Finally, SQL Server knows what to do with your query and will run it.
  * Multiple requests can use the same execution plan at the same time. 
    * Each one will get a separate execution context, which will include state data necessary to keep that particular execution distinct from all others.

PLAN REUSE

The first thing to realize is that SQL Server is lazy, or at least it tries to be.  But before we understand what it means to be lazy, we need to If you have run the same query before, it really wants to recognize that fact, and skip most of this process so that it can jump straight to execution.

How can it do that?  Let's look at some examples.

SQL Server keeps track of metadata that can tell us what statements have been run.  Let me start by clearing out that metadata.

-- Switch to demo

-- Back from demo

DIFFERENT KINDS OF REQUESTS CAN BE CACHED DIFFERENTLY

-- ad-hoc request (SqlCommand.CommandType = "text"; ad-hoc query run from a dedicated database client like SSMS, sqlcmd.exe, etc.)
-- requests submitted via EXEC ('command') (special case of ad-hoc request)
-- auto-parameterized queries (either FORCED or SIMPLE)
-- prepared requests from an API (SqlCommand.Prepare() method; ODBC SQLPrepare() method; etc.
-- requests run using sp_executeSql (basically a special case of prepared requests from an API)
-- stored procedures 
-- triggers (special case of a stored procedure)
-- ad-hoc batches of multiple commands

REQUESTS CAN BE CACHED AT DIFFERENT LEVELS:

-- individual statements
-- entire batches

TAKEAWAYS

You should try to write database requests that are as efficient as possible.  
-- Use indexes judiciously
-- Write sargable queries
-- Minimize unneeded sorts
-- Choose clustered indexes wisely

You can have a noticeable impact on CPU usage by planning for query plan reuse
-- I have seen PCO times sometimes run ten SECONDS on a very capable server but for a very awkward query.  Don't want to do that on every call.
-- For fast queries, the PCO times can be longer than the execution time.  
  -- This means that if QueryA has plan reuse but QueryB has to recompile then QueryA can finish executing before QueryB even begins!

Tools for evaluating what's going on:
-- SQL Profiler.  Great for showing CPU, Read, Write, Duration, and Recompile activity.  Can do a lot more.  Use with extreme caution on production systems
-- Graphical ShowPlan in SQL Server Management Studio.  This is the best showplan utility I have ever seen on any database platform.  Simply wonderful.
-- Dynamic Management Views and Functions.  In these demos, we have made extensive use of sys.dm_exec_cached_plans and sys.dm_exec_sql_text, but there are lots more.

THINGS TO READ:

-- Web articles:  go to http://delicious.com/chris.leonard/iowacodecamp .  
  -- How easy is that?  If you find others, let me know (chris@databaseguy.com) and I will add them.
-- Books:  
  -- the one that matches best with today's talk is probably "Inside SQL Server 2005: Query Tuning and Optimization" by Kalen Delaney.
  -- for further insight into IO processing, read "Inside SQL Server 2005: The Storage Engine," also by Kalen Delaney.
  -- for a broader internals perspective in a single book, choose "Microsoft SQL Server 2008 Internals" by Kalen Delaney, Kimberly Tripp, Paul Randal, Adam Machanic, and Conor Cunningham.
-- Plenty of good stuff in the SQL Server Books Online, MSDN, and Technet - use Google / Bing and go for it.

-- Fodder:
Parameter peeking - how this can go wrong
Statistics
SET options can lead to different plans